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Abstract 

The  aim  of  this  paper  is  to  investigate  to  what  ex- 
tent non  native  speech  may  deteriorate  language  iden- 
tification (LID)  performances  and  to  improve  them 
using  acoustic  adaptation.  Our  reference  LID  sys- 
tem is  based  on  a phonotactic  approach.  The  system 
makes  use  of  language-independent  acoustic  models 
and  language-specific  phone-based  bigram  language 
models.  Experiments  are  conducted  on  the  SQALE 
test  database,  which  contains  recordings  from  En- 
glish, French  and  German  native  speakers,  and  on  the 
MIST  database,  which  contains  non-native  speech  in 
the  same  languages  uttered  by  Dutch  speakers.  Us- 
ing 5 seconds  of  telephone  quality  speech,  language 
identification  error  rate  amounts  to  10%  for  native 
speech  and  to  28%  for  non-native  speech,  thus  yield- 
ing an  important  increase  in  error  rate  in  the  non- 
native case.  We  improve  non-native  language  identi- 
fication by  an  adaptation  of  the  acoustic  models  to 
the  non-native  speech. 


1 INTRODUCTION 

In  the  field  of  automatic  speech  processing,  intensive 
research  activities  have  been  devoted  to  speech  recog- 
nition and  transcription.  With  the  growing  interest 
in  multilinguality  and  multilingual  systems,  language 
identification  (LID)  has  become  a research  area  of  its 
own  [5,7].  In  a multilingual  context  however  speakers 
may  use  foreign  languages  for  communication.  Under 
such  conditions,  i.e.  dealing  with  non-native  speech 
input,  system  performances  are  known  to  decrease. 
Yet  systematic  evaluations  of  such  degradation  and 
research  efforts  to  minimize  them  are  still  to  be  fos- 
tered. 
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Various  information  sources  can  be  exploited  in  or- 
der to  identify  a given  language:  acoustic,  phonemic, 
phonotactic,  lexical,  etc.  In  practice,  for  each  infor- 
mation level  specific  resources  and  corpora  are  re- 
quired for  the  languages  to  be  modeled,  and  in  most 
LID  approaches  only  acoustic-phonetic  and  phonotac- 
tic models  are  used.  The  models  are  usually  trained 
on  native  speech.  Given  the  much  greater  spectral 
variability  commonly  observed  in  non-native  speech, 
performance  is  expected  to  degrade  when  applying  to 
such  material. 

Studying  the  impact  of  non-native  speech  on  LID 
requires  appropriate  test  material.  Ideally  a multilin- 
gual native  speaker  database  and  a multilingual  non- 
native speaker  database  are  required.  Both  corpora 
should  be  similar  in  style  and  recorded  in  compara- 
ble acoustic  conditions.  To  our  knowledge  the  MIST 
database  is  the  first  multi-lingual  corpus  gathering 
non-native  speech;  it  contains  recordings  in  English, 
French  and  German  from  Dutch  speakers.  Similar  na- 
tive speech  material  is  provided  by  the  multilingual 
corpora  produced  within  the  LE-SQALE  project  [6]. 

In  the  following,  we  describe  the  LID  system  used 
for  the  experiments.  We  present  baseline  LID  results 
on  native  speech  using  the  SQALE  test  database, 
and  results  on  non-native  speech  using  the  MIST 
database;  by  means  of  these  experiments  we  measure 
the  impact  of  native  versus  non-native  speech  on  LID 
error  rates.  Finally  we  investigate  the  effectiveness 
of  acoustic  model  adaptation  to  handle  non-native 
speech. 

2 LID  SYSTEM 

The  LID  system  used  in  the  experiments  is  based 
on  a phonotactic  approach,  with  a single  language- 
independent  acoustic-phonetic  decoder.  This  ap- 
proach was  chosen  because,  compared  to  language- 
specific  acoustic  modeling,  it  allows  easier  extension 
of  the  system  to  new  languages,  as  there  is  no  need 
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of  specific  phonetic  knowledge  of  the  new  language 
or  of  a phonetic  labelling  of  the  training  databases. 
The  drawback  is  that  it  generally  requires  longer  test 
segments  to  obtain  optimal  results  as  compared  to 
acoustic-phonetic  approaches.  Previous  work  showed 
that  the  phonotactic  approach  LID  results  signifi- 
cantly improve  when  the  test  segment  length  goes 
from  10s  to  45s  [4]. 

The  system  is  more  extensively  described  in  an- 
other article  [1],  where  it  is  referenced  as  LI_HC 
(language- independent  hierarchically  clustered  phone 
set).  It  is  illustrated  in  Figure  1.  It  uses 
one  single  language-independent  phone  recognizer 
to  label  the  speech  input.  The  phone  sequence 
output  by  this  phone  recognizer  is  then  scored 
with  language-dependent  phonotactic  models  approx- 
imated by  phone  bigrams.  The  language  providing 
the  highest  phonotactic  probability  is  hypothesized. 

2.1  Training  database 

The  LID  system  was  trained  using  the  IDEAL  cor- 
pus, which  is  a multi-language  telephone  speech  cor- 
pus designed  to  support  research  on  LID  [3].  This 
corpus  contains  a large  amount  of  speech  (between  15 
and  18  hours  per  language).  The  different  languages 
were  collected  under  the  same  conditions,  and  na- 
tive speakers  were  recruited  in  their  home  countries. 
Data  have  been  recorded  for  British  English,  Spanish, 
French  and  German.  All  speakers  called  the  LIMSI 
data  collection  system  ensuring  the  same  recording 
conditions  for  the  entire  corpus.  The  IDEAL  corpus 
contains  about  300  calls  for  each  language  (i.e.,  in- 
ternational calls  from  native  U.K.,  Spanish,  and  Ger- 
man speakers  and  national  calls  from  native  French 
speakers),  250  of  them  being  used  for  acoustic  and 
phonotactic  model  estimation  (about  13  hours  per 
language) . 

The  calling  script  was  designed  to  cover  a variety 


of  data  types:  12  questions  to  elicit  precise  responses 
(7  general  questions  concerning  the  call  and  caller, 
and  5 prompts  asking  for  times,  dates,  days  of  the 
week  and  months  of  the  year),  18  items  containing 
predefined  texts  to  read,  and  6 questions  aimed  at 
collecting  spontaneous  speech.  The  acoustic  models 
were  trained  on  all  types  of  material,  and  the  phono- 
tactic models  on  the  spontaneous  speech  part. 

2.2  Front-end  processing 

The  front-end  processing  consists  in  12  MFCC  plus 
the  energy,  augmented  by  their  first  and  second  order 
derivatives,  i.e.  a total  of  39  coefficients  every  10  ms. 
The  same  setting  was  used  for  processing  test  data, 
except  that  signal  frequencies  over  3.5  kHz  were  cut 
in  order  to  be  consistent  with  the  training  database 
which  contains  only  narrow-band  telephone  speech. 

2.3  Acoustic  models 

250  calls  from  IDEAL  (about  9000  sentences,  con- 
taining up  to  13  hours  of  speech  for  each  language) 
have  been  used  for  acoustic  model  training.  First,  4 
language-specific  phone  sets  for  English,  French,  Ger- 
man, and  Spanish  were  trained.  All  acoustic  models 
are  three-state  continuous  density  HMM  of  context- 
independent  phones.  Then  a single  multi-lingual  set 
of  91  monophone  models  was  obtained  by  an  agglom- 
erative  hierarchical  clustering  of  these  4 phone  sets, 
using  a measure  of  similarity  between  phones  [1].  This 
phone  set  has  proven  to  allow  effective  extension  to 
new  languages  [4]. 

2.4  Phonotactic  models 

Phonotactic  models  were  estimated  on  the  sponta- 
neous speech  part  of  the  250  training  calls  which 
accounts  for  about  15%  of  the  IDEAL  corpus.  For 
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each  language,  an  acoustic-phonetic  decoding  of  the 
training  database  was  performed  using  the  multilin- 
gual phone  set.  The  decoded  phone  strings  are  then 
used  to  estimate  language-dependent  bigram  models 
for  English,  French  and  German. 

3 TEST  CORPORA 

Experiments  were  conducted  on  the  SQALE  and 
MIST  databases  for  LID  results  on  native  and  non- 
native speech,  respectively. 

3.1  SQALE  database 

The  development  and  test  data  of  the  SQALE  project 
[6]  were  used  for  the  native  speech  experiments.  The 
4-language  (French,  British  and  American  English, 
German)  speech  database  contains  400  sentences  per 
language  from  40  speakers,  plus  some  diagnostic 
sentences  which  were  not  used  in  our  experiments. 
Within  the  SQALE  project  the  test  sentences  were 
chosen  to  give  a reasonable  spread  of  difficulty  as  de- 
termined by  sentence  length  and  perplexity.  French, 
English  (British  or  American)  and  German  speak- 
ers were  recorded  reading  newspaper  texts  from  Le 
Monde,  Wall  Street  Journal  and  Frankfurter  Rund- 
schau, respectively. 

3.2  MIST  database 

The  MIST  database  was  developed  by  the  TNO  Hu- 
man Factors  Research  Institute  to  support  research 
in  multi-linguality  and  non-native  speech.  74  native 
Dutch  speakers  (52  male,  22  female)  uttered  10  sen- 
tences in  Dutch,  and  also  for  most  of  them  in  En- 
glish, French  and  German:  5 sentences  per  language 
identical  for  all  speakers  and  5 unique  sentences  per 
language  and  per  speaker.  The  text  sources  are  the 
same  as  for  the  SQALE  project  concerning  English, 
French  and  German.  We  used  only  unique  sentences 
for  evaluation  on  non  native  speech  because  identical 
sentences  are  not  phonetically  balanced  over  time.  Fi- 
nally, the  selected  part  of  the  MIST  database  contains 
about  300  sentences  per  language. 

4 EXPERIMENTAL  RESULTS 

We  present  LID  error  rates  for  each  language  as  a 
function  of  sentence  duration.  Every  second,  the  sys- 
tem takes  a decision  on  the  speech  segment  decoded 
so  far.  For  a given  test  duration,  only  sentences  longer 
than  this  duration  were  used.  In  order  to  reduce  du- 
ration variability  due  to  pauses  and  hesitations,  the 


silences  labelled  by  the  recognizer  are  discounted  from 
the  sentence  duration.  For  both  test  corpora,  mean 
sentence  duration  is  about  6 seconds.  Few  sentences 
are  more  than  8s  long,  and  no  significant  LID  results 
were  obtained  for  segment  durations  over  this  dura- 
tion. 

4.1  Results  on  native  speech 

Identification  results  (on  a second  per  second  basis) 
on  the  SQALE  database  are  provided  in  Figure  2.  For 
5 second  segments,  the  global  error  rate  amounts  to 
10%.  This  global  rate  does  not  show  the  disparity 
between  languages;  indeed,  error  rates  of  16%,  3% 
and  10%  are  achieved  for  English,  French  and  Ger- 
man speech,  respectively.  For  all  durations,  results 
on  French  are  significantly  better  than  on  the  other 
languages.  This  might  be  attributed  to  the  difference 
between  French  national  and  international  telephone 
networks. 
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Figure  2:  LID  error  rates  for  the  native  language  task 
(SQALE  database)  as  a function  of  segment  duration. 


4.2  Results  on  non-native  speech 

Similar  experiments  were  conducted  on  non-native 
speech.  The  identification  results  using  the  three  non- 
native MIST  languages  are  illustrated  in  Figure  3. 
On  5 second  segments,  LID  error  rates  for  non-native 
English,  French  and  German  are  23%,  29%  and  31%, 
respectively.  The  global  LID  error  rate  of  the  three 
non-native  languages  is  28%. 

4.3  Comparison  between  native  and 
non-native  speech 

The  comparison  of  the  identification  results  for  native 
and  non-native  speech  for  each  language  is  illustrated 
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Figure  4:  LID  error  rate  comparison  between  native  and  non-native  speech  for  English,  French  and  German  as 
a function  of  segment  duration. 


1 2 3 4 5 6 7 8 
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Figure  3:  LID  error  rates  for  the  non-native  language 
task  (MIST  database)  as  a function  of  segment  dura- 
tion. 

in  Figure  4.  For  French  and  German,  the  non-native 
Dutch  accent  increases  the  error  rates  as  expected. 
But  error  rate  increase  for  non-native  English,  though 
significant,  is  much  lower.  The  English  phonotactic 
model  seems  to  be  more  robust  with  respect  to  accent 
variation.  Another  more  linguistically  motivated  con- 
clusion consists  in  suggesting  that  Dutch  speakers  are 
best  in  speaking  English  as  compared  to  French  and 
German.  For  5 second  segments,  the  global  error  rate 
amounts  to  10%  for  native  speech  and  to  28%  for  non- 
native speech,  showing  an  important  increase  in  error 
rate  (cf.  Table  1). 

4.4  Adaptation  of  acoustic  models 

Better  results  on  non-native  speech  should  be  ob- 
tained after  adaptating  the  LID  system  to  the  new 
conditions.  Given  the  size  of  the  available  non  native 
speech  material  (the  MIST  test  database),  an  adap- 
tation of  the  phonotactic  models  does  not  seem  possi- 


Table  1:  Per  language  and  global  LID  error  rates 
on  native  speech  (SQALE  database)  and  non-native 
speech  (MIST  database)  for  5 seconds  of  speech. 


SQALE 

MIST 

relative 

increase 

English 

16% 

23% 

xl.4 

French 

3% 

29% 

xlO 

German 

10% 

31% 

x3.1 

Global  rate 

10% 

28% 

x2.8 

ble,  and  only  acoustic  models  adaptation  was  tested. 
For  a better  use  of  the  available  data,  the  non-native 
MIST  data  were  jack-knifed  in  5 sets;  the  results  were 
obtained  by  testing  each  set  with  acoustic  models 
adapted  on  the  remaining  part  of  the  database. 

Each  non-native  sentence  of  the  adaptation  sub- 
set is  aligned  with  the  original  prompt  using  the 
language-dependent  acoustic  models  and  produces 
a phone  segmentation  which  is  converted  into  the 
language-independent  phone  set.  For  each  of  the 
three  non-native  language,  the  acoustic  models  (in- 
cluding means,  variances  and  weights  of  gaussians) 
are  adapted  towards  the  non-native  acoustic  realiza- 
tion of  the  phones.  As  a result,  we  get  three  sets  of 
acoustic  models.  A weighting  factor  allows  to  control 
the  degree  of  adaptation. 

The  LID  system  with  adapted  acoustic  models  is 
finally  tested  on  the  left-out  fifth  of  the  database. 
Each  test  sentence  is  decoded  using  the  three  adapted 
acoustic  models  in  parallel  with  the  original  multi- 
lingual phone  set,  and  the  four  phone  sequences  ob- 
tained are  scored  with  the  phonotactic  models.  The 
chosen  language  is  the  one  with  the  highest  global 
probability. 

Figure  5 shows  the  global  LID  error  rates  after 
adaptation  of  the  acoustic  models.  On  5 second  seg- 
ment, LID  error  rates  of  21%,  22%  and  27%  are 
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Figure  5:  Gobal  LID  error  rate  for  the  native  lan- 
guage task  (SQALE  database)  and  for  the  non-native 
language  task  (MIST  database)  before  and  after 
adaptation  of  acoustic  models,  as  a function  of  seg- 
ment duration. 


achieved  for  non-native  English,  French  and  German 
respectively  (these  figures  can  be  compared  to  those 
in  Table  1).  A 14%  relative  decrease  of  the  global  LID 
error  rate  is  observerd  for  the  three  non-native  lan- 
guages (24%  with  adaptation  vs.  28%  without  adap- 
tation); despite  the  small  size  of  the  test  set,  this  im- 
provement can  be  shown  to  be  significant  using  Mc- 
Nemar’s  test  [2]. 

5 CONCLUSIONS 

Experiments  have  been  carried  out  with  a 
phonotactic-based  approach  LID  system  on  a 3- 
language  task  using  native  and  non-native  speech 
(SQALE,  MIST  corpora). 

Using  5 seconds  of  telephone  quality  speech,  LID 
error  rate  increased  from  10%  for  native  speech  to 
28%  for  non-native  speech.  Given  the  limited  amount 
of  test  data,  the  test  segment  duration  has  been  lim- 
ited to  a maximum  length  of  8 seconds,  which  stays 
far  away  from  the  typical  durations  (30s  and  more)  for 
which  the  phonotactic  LID  approach  performs  best. 

Adaptation  of  the  acoustic  model  sets  allowed 
to  significantly  reduce  the  error  rate  on  non-native 
speech.  Using  the  phonotactic  approach,  adaptation 
of  the  phonotactic  models  should  be  more  efficient, 
but  it  could  not  be  tested  with  the  databases  involved. 

Needs  for  further  investigation  are  obvious.  Study- 
ing the  effects  of  non-native  speech  on  LID  requires 
larger  databases  including  more  utterances  of  longer 
durations,  more  languages  and  various  foreign  ac- 
cents. The  development  cost  of  such  resources  is  of 


course  a major  issue.  But  the  MIST  database,  even 
if  only  devoted  to  Dutch  accent  over  a few  European 
languages,  was  clearly  an  excellent  starting  point  for 
the  study  of  non-native  speech,  especially  because  of 
its  matching  with  the  already  studied  native  SQALE 
database. 
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